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Abstract 

We study the problem of estimating, in the sense of optimal transport 
metrics, a measure which is assumed supported on a manifold embed- 
ded in a Hilbert space. By establishing a precise connection between 
optimal transport metrics, optimal quantization, and learning theory, we 
derive new probabilistic bounds for the performance of a classic algorithm 
in unsupervised learning (k-means), when used to produce a probability 
measure derived from the data. In the course of the analysis, we arrive at 
new lower bounds, as well as probabilistic upper bounds on the conver- 
gence rate of the empirical law of large numbers, which, unlike existing 
bounds, are applicable to a wide class of measures. 

1 Introduction and Motivation 

In this paper we study the problem of learning from random samples a proba- 
bility distribution supported on a manifold, when the learning error is measured 
using transportation metrics. 

The problem of learning a probability distribution is classic in statistics and 
machine learning, and is typically analyzed for distributions in X = R d that 
have a density with respect to the Lebesgue measure, with total variation, and 
L 2 among the common distances used to measure closeness of two densities (see 
for instance [TUJ [35] and references therein.) The setting in which the data 
distribution is supported on a low dimensional manifold embedded in a high 
dimensional space has only been considered more recently. In particular, kernel 
density estimators on manifolds have been described in |35j , and their pointwise 
consistency, as well as convergence rates, have been studied in [551 H31 H] • A 
discussion on several topics related to statistics on a Riemannian manifold can 
be found in [26] . 

In this paper, we consider the problem of estimating, in the 2-Wasserstcin 
sense, a distribution supported on a manifold embedded in a Hilbert space. 



The exact formulation of the problem, as well as a detailed discussion of related 
previous works are given in Section [2] 

Interestingly, the problem of approximating measures with respect to trans- 
portation distances has deep connections with the fields of optimal quantiza- 
tion [141 116] , optimal transport |34j and, as we point out in this work, with 
unsupervised learning (see Sec. [IJ) In fact, as described in the sequel, some 
of the most widely-used algorithms for unsupervised learning, such as k-means 
(but also others such as PC A and k- flats), can be shown to be performing ex- 
actly the task of estimating the data-generating measure in the sense of the 
2-Wasserstein distance. This close relation between learning theory, and op- 
timal transport and quantization seems novel and of interest in its own right. 
Indeed, in this work, techniques from the above three fields are used to derive 
the new probabilistic bounds described below. 

Our technical contribution can be summarized as follows: 

(a) we prove uniform lower bounds for the distance between a measure and 
estimates based on discrete sets (such as the empirical measure or measures 
derived from algorithms such as k-means); 

(b) we provide new probabilistic bounds for the rate of convergence of the em- 
pirical law of large numbers which, unlike existing probabilistic bounds, 
hold for a very large class of measures; 

(c) we provide probabilistic bounds for the rate of convergence of measures 
derived from k-means to the data measure. 

The structure of the paper is described at the end of Section [2] where we 
discuss the exact formulation of the problem as well as related previous works. 

2 Setup and Previous work 

Consider the problem of learning a probability measure p defined on a space M. , 
from an i.i.d. sample X n = (x\, . . . , x n ) ~ p n of size n. We assume M. to be a 
compact, smooth d-dimensional manifold with C 1 metric and volume measure 
Xm , embedded in the unit ball of a separable Hilbert space X with inner product 
(■,■), induced norm || • ||, and distance d (for instance M. — 3% (1) the unit ball 
in X = R d .) Following [331 p. 94], let P p (M) denote the Wasserstein space of 
order 1 < p < oo: 

P P (M) := |p G P{M) : J \\x\\ p dp(x) < oo J 

of probability measures with finite p-th moment. The p- Wasserstein distance 

W p (p,fi) = inf {[E\\X - Y\\ p ] 1/p , Law(X)=p, Law(Y) = fi\ (1) 

where the inf is over random variables X, Y with laws p, p, respectively, is 
the optimal expected cost of transporting points generated from p to those 
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generated from p, and is guaranteed to be finite in P p {M) [3H p. 95]. The space 
P P (M) with the W p metric is itself a complete separable metric space [31]. We 
consider here the problem of learning probability measures p G P2{M), where 
the performance is measured by the distance W%(p, •). 

There are many possible choices of distances between probability measures [13] . 
Among them, W p metrizes weak convergence (see [31] theorem 6.9), that is, in 
P p (A4), a sequence (/ii)igN of measures converges weakly to p iff W p (/i,,/x) — > 0. 
There are other distances, such as the Levy-Prokhorov, or the weak-* distance, 
that also metrize weak convergence. However, as pointed out by Villani in his 
excellent monograph [3H p. 98], 

f . "Wasserstein distances are rather strong, [...]a definite advantage over the 
weak-* distance". 

2. "It is not so difficult to combine information on convergence in Wasserstein 
distance with some smoothness bound, in order to get convergence in 
stronger distances." 

Wasserstein distances have been used to study the mixing and convergence of 
Markov chains [55] , as well as concentration of measure phenomena [50] ■ To this 
list we would add the important fact that existing and widely-used algorithms 
for unsupervised learning can be easily extended (see Sec. [4| to compute a 
measure p' that minimizes the distance W2(p n , p') to the empirical measure 

1 - 

Pn ■= -/]fix t , 
i=l 

a fact that will allow us to prove, in Sec. [5] bounds on the convergence of the 
measure induced by k-means to the population measure p. 

The most useful versions of Wasserstein distance are p = 1,2, with p = 1 
being the weaker of the two (by Holder's inequality, p < q => W p < W q ; a 
discussion of p — oo would take us out of topic, since its behavior is markedly 
different.) In particular, "results in W2 distance are usually stronger, and more 
difficult to establish than results in W\ distance" [3H p. 95]. 

2.1 Closeness of Empirical and Population Measures 

By the empirical law of large numbers, the empirical measure converges almost 
surely to the population measure: p n — > p in the sense of the weak topology |33j . 
Since weak convergence and convergence in W p are equivalent in P p (M), this 
means that, in the W p sense, the empirical measure p n is an arbitrarily good 
approximation of p, as n — ¥ 00. A fundamental question is therefore how fast 
the rate of convergence of p n p is. 

2.1.1 Convergence in expectation 

The mean rate of convergence of p n — > p has been widely studied in the past, 
resulting in upper bounds of order EW 2 (p,p n ) = 0(n- 1 /( d + 2 )) [HI [8], and lower 
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bounds of order EW 2 (p,p n ) = IM ( both 

assuming that the absolutely 
continuous part of p is pA 7^ 0, with possibly better rates otherwise). 

More recently, an upper bound of order EW p (p, p n ) = 0(n~ x / d ) has been 
proposed [2] by proving a bound for the Optimal Bipartite Matching (OBM) 
problem pQ, and relating this problem to the expected distance EW p (p, p n ). In 
particular, given two independent samples X n ,Y n , the OBM problem is that 
of finding a permutation a that minimizes the matching cost n~ 1 '^2\\xi — 
U<7(i)\\ p [211 150] . It is not hard to show that the optimal matching cost is 
W p (p x , p Yn ) p , where p Xn , p Y are the empirical measures associated to X n , Y n . 
By Jensen's inequality, the triangle inequality, and (a 4- b) p < 2 p ~~ 1 (a p + IP), it 
holds 

EW p (p, Pn ) p < EW p (p Xn ,p Y J p < 2P^ 1 EM^(p,p„)P, 

and therefore a bound of order 0(n~ p / d ) for the OBM problem [5] implies a 
bound EW p (p, p n ) = 0(n~ 1 / d ). The matching lower bound is only known for 
a special case: pa constant over a bounded set of non-null measure [2] (e.g. pa 
uniform.) Similar results, with matching lower bounds are found for W\ in 

2.1.2 Convergence in probability 

Results for convergence in probability, one of the main results of this work, 
appear to be considerably harder to obtain. One fruitful avenue of analysis has 
been the use of so-called transportation, or Talagrand inequalities T p , which can 
be used to prove concentration inequalities on W p |20j . In particular, we say that 
p satisfies a T p (C) inequality with C > iff W p (p, p) 2 < CH(p\p),Vp € P p (M), 
where H(-\-) is the relative entropy [20] . As shown in [5J[S], it is possible to 
obtain probabilistic upper bounds on W p {p, p n ), with p = 1,2, if p is known 
to satisfy a T p inequality of the same order, thereby reducing the problem of 
bounding W p (p, p n ) to that of obtaining a T p inequality. Note that, by Jensen's 
inequality, and as expected from the behavior of W p , the inequality T2 is stronger 
than Ti [20]. 

While it has been shown that p satisfies a 1\ inequality iff it has a finite 
square-exponential moment [H [7] , no such general conditions have been found 
for T2. As an example, consider that, if Ai is compact with diameter D then, 
by theorem 6.15 of [34], and the celebrated Csiszar-Kullback-Pinsker inequal- 
ity [27], for all p, p e P P (M), it is 

W p (p,pf p < (2D) 2p \\p-p\\ 2 TV < 2 2p - l D 2p H{p\p), 

where || • ||tv is the total variation norm. Clearly, this implies a T p= \ inequality, 
but for p > 2 it does not. 

The T2 inequality has been shown by Talagrand to be satisfied by the Gaus- 
sian distribution [31 , and then slightly more generally by strictly log-concave 
measures [3J. However, as noted in [0J, "contrary to the T\ case, there is no 
hope to obtain T 2 inequalities from just integrability or decay estimates." 
Structure of this paper. In this work we obtain bounds in probability (learn- 
ing rates) for the problem of learning a probability measure (in the sense of W2-) 
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We begin by establishing (lower) bounds for the convergence of empirical to 
population measures, which serve to set up the problem and introduce the con- 
nection between quantization and measure learning (sec. [3j) We then describe 
how existing unsupervised learning algorithms that compute a set (k-means, 
k-flats, PCA,. . . ) can be easily extended to produce a measure (sec. [4]) Due 
to its simplicity and widespread use, we focus here on k-means. Since the two 
measure estimates that we consider are the empirical measure, and the measure 
induced by k-means, we next set out to prove upper bounds on their convergence 
to the data-generating measure (sec. [5]) We arrive at these bounds by means of 
intermediate measures, which are related to the problem of optimal quantiza- 
tion. The bounds apply in a very broad setting (unlike existing bounds based 
on transportation inequalities, they are not restricted to log-concave measures.) 

3 Learning probability measures, optimal trans- 
port and quantization 

We address the problem of learning a probability measure p when the only obser- 
vation we have at our disposal is an i.i.d. sample X n . We begin by establishing 
some notation and useful intermediate results. 

Given a closed set S C M., let ir s = J2 q es ^v q (S) • <Z be a nearest neighbor 
projection onto S (a function mapping points in X to their closest point in S), 
where {V q (S) : q € S} is a Borel Voronoi partition of X such that V q (S) C {x € 
X : ||x — q\\ = min rS 5 ||a; — r||} (sec for instance [E].) Since S is closed and 
|| x — ■ || is continuous and convex, every points x <E X has a closest point in S. 
Since {V q {S) : q € S} is a Borel partition, it follows that 7r s is a measurable 
map. For any p £ P p (M.), the pushforward, or image measure n s p under the 
mapping tt s is supported in S, and is such that, for Borel measurable sets A, it 
is (tt sP )(A) := p(^ l (A)). 

We now establish a connection between the expected distance to a set 5 1 , and 
the distance between p and the set's induced pushforward measure. Notice that 
the expected distance to S is exactly the expected quantization error incurred 
when encoding points drawn from p by their closest point in S. This close 
connection between optimal quantization and Wasserstein distance has been 
pointed out in the past in the statistics [28], optimal quantization [21 p. 33], 
and approximation theory literatures [15] . 

The following two lemmas are key tools in the reminder of the paper. The 
first highlights the close link between quantization and optimal transport. 

Lemma 3.1. For closed S C M. p G P p (M), 1 < p < oo, it holds E x ^ p d(x, S) p = 

w p ( P ,K sP y. 

Note that the key element in the above lemma is that the two measures in 
the expression W p (p, ir s p) must match. When there is a mismatch, the distance 
can only increase. That is, W p (p,n s p) > W p (p,ir s p) for all (i € P p (M). In 
fact, the following lemma shows that, among all the measures with support in 
S, n s p is closest to p. 
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Lemma 3.2. For closed S, and all p € P p {AA) with supp(/i) C S, 1 < p < oo, 
it holds W p {p,p) > Wp(p,Tr s p). 

When combined, lemmas |3.1| and |3.2| indicate that the behavior of the mea- 
sure learning problem is limited by the performance of the optimal quantization 
problem. For instance, W p (p, p n ) can only be, in the best-case, as low as the 
optimal quantization cost with codebook of size n. The following section makes 
this claim precise. 



3.1 Lower bounds 

Consider the situation depicted in fig-[l] in which a sample X4, = {xi, X2, £3, 2:4} 
is drawn from a distribution p which we assume here to be absolutely continuous 
on its support. As shown, the projection map ir Xi sends points x to their closest 
point in X4. The resulting Voronoi decomposition of supp(p) is drawn in shades 
of blue. By lemma 5.2 of [5], the pairwise intersections of Voronoi regions have 
null ambient measure, and since p is absolutely continuous, the pushforward 
measure can be written in this case as ^ Xi p = Xw=i p{Vj)fixj> where Vj is the 
Voronoi region of Xj. Note that this decomposition is not always possible if, 
for instance p has an atom falling on two Voronoi regions: both regions would 
count the atom as theirs, and double-counting would imply p{Vj) > 1. The 
technicalities required to correctly define p(Vj) are such that, in general, it is 
simpler to write irsp, even though (if S is discrete) this measure can clearly be 
written as a sum of deltas with appropriate masses. 



By lemma 3.1 the distance W p {p, ^ Xi p) p is the (expected) quantization cost 
of p when using X4 as codebook. Clearly, this cost can never be lower than the 
optimal quantization cost of size 4. This reasoning leads to the following lower 
bound between empirical and population measures. 

Theorem 3.3. For p G P p {AA) with absolutely continuous part pa 7^ 0, and 1 < 
p < 00, it holds W p (p,p n ) = f2(n^ 1 / d ) uniformly over p n , where the constants 
depend on d and pa only. 

Proof: Let V niP (p) := m£scM.\s\=n^x~pd(x, S) p be the optimal quantization 
cost of p of order p with n centers. Since pa 7^ 0, and since p has a finite 
(p + S)-th order moment, for some 5 > (since it is supported on the unit ball), 
then it is V ntP (p) = Q(n~ p / d ), with constants depending on d and pA (see [21 
p. 78] and [IS]-) Since supp(p n ) = X n , it follows that 

W p (p,p n ) p > W p (p,tt x pf = E x ^ p d(x,X n r>V rhp (p) = Q(n-P/ d ) □ 

lemma \3j\ lemma |3JJ 



Note that the bound of theorem |3.3| holds for p n derived from any sample 
X n , and is therefore stronger than the existing lower bounds on the convergence 
rates of KW p {p, p n ) — > 0. In particular, it trivially induces the known lower 
bound f^n -1 ^) on the expected rate of convergence. 



The consequence of theorem 3.3 is clearly that the rate of convergence of 



the empirical law of large numbers is limited (in all cases), by the dimension of 
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the space in which p is absolutely continuous. This justifies the choice of formal 
setting to be a d-manifold (or even M. d ): by the above uniform lower bound, one 
is effectively forced to make a finite-dimension assumption on the space where 
p is absolutely continuous. 

4 Unsupervised learning algorithms for learning 
a probability measure 

As described in [5T], several of the most widely used unsupervised learning 
algorithms can be interpreted to take as input a sample X n and output a set 
Sk , where k is typically a free parameter of the algorithm, such as the number 
of means in k-mean^J the dimension of affine spaces in PCA, etc. Performance 
is measured by the empirical quantity n^ 1 Y^i=i d(xi, Sk) 2 , which is minimized 
among all sets in some class (e.g. sets of size fc, affine spaces of dimension 
k,. . . ) This formulation is general enough to encompass k-means and PCA, but 
also k-flats, non-negative matrix factorization, and sparse coding (see [21] and 
references therein.) 

Using the discussion of Sec. [3j we can establish a clear connection between 
unsupervised learning and the problem of learning probability measures with 
respect to W<i. Consider as a running example the k-means problem, though 
the argument is general. Given an input X n , the k-means problem is to find a 
set \Sk\ — k minimizing its average distance from points in X n . By associating 
to Sk the pushforward measure irg k p n , we find that 

n 

- Vd(i I; 5 t ) 2 =E x ^ n d(x,S k ) 2 = W 2 {p n ,Hg i p n f. (2) 

i—1 

Since k-means minimizes equation [2j it also finds the measure that is closest 
to p ni among those with support of size k. This connection between k-means 
and W2 measure approximation was, to the best of the authors' knowledge, first 
suggested by Pollard [35] though, as mentioned earlier, the argument carries 
over to many other unsupervised learning algorithms. 

We briefly clarify the steps involved in using an existing unsupervised learn- 
ing algorithm for probability measure learning. Let Ilk be a parametrized algo- 
rithm (e.g. k-means) that takes a sample X n and outputs a set Uk(X n ). The 
measure learning algorithm Ak '■ ■M n — > P p (A4) corresponding to Uk is defined 
as follows: 

1. Ak takes a sample X n and outputs the measure ^g k p n , supported on 
Sk = Uk{X n ); 

2. since p n is discrete, then so must 7Ta p n be, and thus Ak(X n ) = - X)"=i ^tt- (xi)\ 

In a slight abuse of notation, we refer to the k-means algorithm here as an ideal algorithm 
that solves the k-means problem, even though in practice an approximation algorithm may 
be used. 
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3. in practice, we can simply store an n-vector n§ (xi), . . . , 7r§ (x n ) , from 

which Ak{X n ) can be reconstructed by placing atoms of mass 1/n at each 
point. 

In the case that U k is the k-means algorithm, only k points and k masses need 
to be stored. 

Note that any algorithm A' that attempts to output a measure A'(X n ) close 
to p n can be cast in the above framework. Indeed, if S' is the support of A'(X n ) 



then, by lemma 3.2 ns>p n is the measure closest to p n with support in S'. This 
effectively reduces the problem of learning a measure to that of finding a set, 
and is akin to how the fact that every optimal quantizer is a nearest-neighbor 
quantizer (see [TS], [HI p. 350], and [HI p. 37-38]) reduces the problem of 
finding an optimal quantizer to that of finding an optimal quantizing set. 

Clearly, the minimum of equation [2] over sets of size k (the output of k- 
means) is monotonically non-increasing with k. In particular, since S n = X n 
and n§ n p n = Pn, it is E x ^p n d(x, S n ) 2 = W 2 (p ni ir s n Pn) 2 = 0. That is, we 
can always make the learned measure arbitrarily close to p n by increasing k. 
However, as pointed out in Sec. [2j the problem of measure learning is concerned 
with minimizing the distance W 2 (p,-) to the data-generating measure. The 
actual performance of k-means is thus not necessarily guaranteed to behave 
in the same way as the empirical one, and the question of characterizing its 
behavior as a function of k and n naturally arises. 

Finally, we note that, while it is E x ^p n d(x, Su) 2 — W 2 (p n , ^§ k Pn) 2 (the 
empirical performances are the same in the optimal quantization, and measure 
learning problem formulations), the actual performances satisfy 

E x ^ p d(x,S k ) 2 = W 2 (p,n Sk p) 2 <W 2 (p,ir §k p n ) 2 , 1 < k < n. 

Lemma [d^l] lemma 13.21 

Consequently, with the identification between sets S and measures 7r s/ o n , the 
set-approximation problem is, in general, different from the measure learning 
problem (for example, if M = R d and p is absolutely continuous over a set of 
non-null volume, it's not hard to show that the inequality is almost surely strict: 
E x ^ p d{x,S k ) 2 < W 2 {p,TTg k Pn) 2 for n > k > 1.) 

In the remainder, we characterize the performance of k-means on the mea- 
sure learning problem, for varying k, n. Although other unsupervised learning 
algorithms could have been chosen as basis for our analysis, k-means is one 
of the oldest and most widely used, and the one for which the deep connec- 
tion between optimal quantization and measure approximation is most clearly 
manifested. Note that, by setting k — n, our analysis includes the problem 
of characterizing the behavior of the distance W 2 {p 1 p n ) between empirical and 
population measures which, as indicated in Sec. |2.1[ is a fundamental question 
in statistics (i.e. the speed of convergence of the empirical law of large numbers.) 
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5 Learning rates 



In order to analyze the performance of k-means as a measure learning algo- 
rithm, and the convergence of empirical to population measures, we propose 
the decomposition shown in fig. [5J The diagram includes all the measures con- 
sidered in the paper, and shows the two decompositions used to prove upper 
bounds. The upper arrow (green), illustrates the decomposition used to bound 
the distance W 2 (p,p n ), This decomposition uses the measures ns k P and ^s k Pn 
as intermediates to arrive at p n , where Sk is a A-point optimal quantizer of p, 
that is, a set Sk minimizing K xr ^ p d(x, S) 2 and such that \Sk\ = k. The lower ar- 
row (blue) corresponds to the decomposition of W 2 (p, n s k P n ") (^he performance 
of k-means), whereas the labelled black arrows correspond to individual terms 
in the bounds. We begin with the (slightly) simpler of the two results. 



W 2 ( P ,p n ) 




Figure 1: A sample {x\, x 2 , X3, X4} is 
drawn from a distribution p with sup- 
port in supp p. The projection map 
7r {x 1 ,x 2 ,x 3 ,x 4 } sends points x to their 
closest one in the sample. The induced 
Voronoi tiling is shown in shades of blue. 



Figure 2: The measures considered in 
this paper are linked by arrows for which 
upper bounds for their distance are de- 
rived. Bounds for the quantities of in- 
terest W 2 (p, p n ) 2 , and W 2 (p,TT§ k p n ) 2 , 
are decomposed by following the top 
and bottom colored arrows. 



5.1 Convergence rates for the empirical law of large num- 
bers 

Let Sk be the optimal fc-point quantizer of p of order two (141 p. 31]. By the 
triangle inequality and the identity (a + b + c) 2 < 3 (a 2 + b 2 + c 2 ), it follows that 

W 2 (p, Pn ) 2 < 3 [W 2 (p,7r Sk p) 2 + W 2 (ns kPl ns k Pn) 2 + W 2 (n Sk p n ,Pn) 2 ] ■ (3) 

This is the decomposition depicted in the upper arrow of fig. [2] 

By lemma 3.1 the first term in the sum of equation [3] is the optimal /c-point 
quantization error of p over a d-manifold M. which, using recent techniques 
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from [TB] (see also [TTJ, p. 491]), is shown in the proof of theorem 5.1 (part a) 
to be of order 6(fc- 2 / d ). The remaining terms, b) and c), are slightly more 
technical and are bounded in the proof of theorem |5.1| 

Since equation [3] holds for all 1 < k < n, the best bound on W^{p, p n ) can be 
obtained by optimizing the right-hand side over all possible values of k, resulting 
in the following probabilistic bound for the rate of convergence of the empirical 
law of large numbers. 

Theorem 5.1. Given p 6 P p (M) with absolutely continuous part pa i= 0, 
sufficiently large n, and < 5 < 1, it holds 

W 2 (p,p n ) < C ■ m(p A ) ■ 71 -V(2rf+4) . Tj mth probafotity \ _ 

where m(pA) :— J M pA{x) d ^ d+2 ^d\M{x), and C depends only on d. 

Proof. See Appendix. □ 

5.2 Learning rates of k-means 



The key element in the proof of theorem 5.1 is that the distance between pop- 
ulation and empirical measures can be bounded by choosing an intermediate 
optimal quantizing measure of an appropriate size k. In the analysis, the best 
bounds are obtained for k smaller than n. If the output of k-means is close to 
an optimal quantizer (for instance if sufficient data is available), then we would 
similarly expect that the best bounds for k-means correspond to a choice of 
k < n. 

The decomposition of the bottom (blue) arrow in figure [2] leads to the fol- 
lowing bound in probability. 

Theorem 5.2. Given p G P p {M) with absolutely continuous part pa ^ 0, and 
< 5 < 1, then for all sufficiently large n, and letting 

k = C-m{ PA )-n d ^ 2d+i \ 

it holds 

W 2 {p, ^s k P' n ) - C ■ m(pA) ■ n~ 1 ^ 2d+AS) ■ t, with probability 1 - e~ r2 . 

where m(pA) := J M p A{x) d ' ^ d+2 ^ d\j^[{x) , and C depends only on d. 

Proof. See Appendix. □ 



Note that the upper bounds in theorem |5.1| and |5.2| are exactly the same. 
Although this may appear surprising, it stems from the following fact. Since 
S = Sk is a minimizer of WifasPm Pn) 2 , the bound d) of figure [2] satisfies: 

W 2 (tt g k p n , p n ) 2 < W 2 (ir Sk p n ,p n ) 2 

and therefore (by the definition of c), the term d) is of the same order as c). 



Since f) is also of the same order as c) (see the proof of theorem 5.2), this 
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means that, up to a small constant factor, adding the term d) to the bound of 
W%(p, Kg Pn) 2 does not affect the bound. Since d) is the term that takes the 
output measure of k-means to the empirical measure, this implies that the rate 
of convergence of k-means cannot be worse than that of p n — > p. Conversely, 
bounds for p n — > p are obtained from best rates of convergence of optimal 
quantizers, whose convergence to p cannot be slower than that of k-means (since 
the quantizers that k-means produces are suboptimal.) 

Since the bounds obtained for the convergence of p n — > p are the same as 
those for k-means with k of order k = 8(n d ^ 2<i+4 ^), this implies that estimates 
of p that are as accurate as those derived from an n point-mass measure p n can 
be derived from k point-mass measures with k <C n. 

Finally, we note that the introduced bounds are currently limited by the 
statistical bound 

sup \W2(ir s p n ,p n ) 2 -W 2 {nsp,p) 2 \ = _ sup \E x ^ n d(x, S) 2 -E x „ p d(x, S) 2 \ 

\S\=k ^mma,\3Sl\S\=k 

(see for instance [21j). for which non-matching lower bounds are known. This 
means that, if better upper bounds can be obtained, then both bounds in the- 
orems [O] and [5]2] would automatically improve. 
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